© Scott Robison 2021 all rights reserved.
For Chapter 11: Linear Models and Estimation by Least Squares, we will be covering the same material as the text book but likely not in the same order. I will list the page range for the whole of Chapter 11 but will go at my own pace and order, so please reference the text book at your own convenience
Chapter 11 pages 563-609 from the text.
After discovering covariance between bivariate data, you will likely want to know how to describe/express the relationship. Successful descriptions can then be used to model expected results on one of the variables deemed the response variable, \(Y\), based only off the predictor variable, \(X\).
The steps are:
Collect bivariate X and Y variables from historic events. This data set will serve as “training” to understand the linear relationship that exists between the variables.
Develop a mathematical expression/equation to transform an particular/hypothetical \(X\) into an expected/estimated \(Y\).
Consider, “bivariate” data expressing temperature in \(^{\circ} C\), degrees Celsius, (call this the \(X\) variable) and then in \(^{\circ} F\), degrees Fahrenheit, (call this the \(Y\) variable).
What do you notice about the scatterplot?
Do you see how all the points fall exactly on the line? This is called a deterministic model; since all points fall exactly on the line we could perfectly predict where no points exist.
The linear equation will follow the deterministic model’s form:
\[\begin{align} Y_i=\beta_0+\beta_1 X_i,&& i=1,2,…,n\\ \end{align}\]
where \(\beta_0\) is the y-intercept (the \(Y\) value when \(X=0\)) and \(\beta_1\) is the slope (rate of change in \(Y\) with respect to \(X\)).
In a deterministic model only two sample points are all that is required to find the model:
\[\begin{align} \beta_1=\frac{rise}{run}=\frac{y_2-y_1}{x_2-x_1},&& \text{ then }&\beta_0=y_1-\beta_1 x_1\\ \end{align}\]
Of course in the “real” world we often lack the ability to measure variables with deterministic precision. We expect response and or measurement bias in our observations. Additionally, when dealing with random variables we know that our observations may lack consistency not due to any bias at all!Let’s return to our example of ten student’s height and weight, and try to select two points from the data set then create linear models…
So which model of a non-deterministic data set is best?… The probabilistic model appears to the same as the deterministic model, however, it includes an addition \(\varepsilon_i\) term:
\[\begin{align} Y_i=\beta_0+\beta_1 X_i+\varepsilon_i,&& i=1,2,…,n,&& \text{ where }\varepsilon_i\sim Norm(\widehat{\beta}_0+\widehat{\beta}_1 X_i,\sigma^2) \end{align}\]
The probabilistic model can also be written this way, to “hide” the \(\varepsilon_i\) term by admitting the following values are estimates:\[\begin{align} \widehat{Y}_i=\widehat{\beta}_0+\widehat{\beta}_1 X_i,&& i=1,2,…,\\ \end{align}\]
Let \(\widehat{Y}\) be the probabilistic model that is the closest “overall” to each sample point, meaning the set of sample points \((X_i,Y_i )\)’s that are respectively closest to the \((\widehat{X}_i,\widehat{Y}_i )\)’s. Then we define the difference between \(Y_i\) and \(\widehat{Y}_i\) to be \(\varepsilon_i\).
Then \(\varepsilon_i=Y_i-\widehat{Y}_i\) (residuals/errors/residual errors), we are setting \(\sum_{i=1}^n \varepsilon_i =\sum_{i=1}^n(Y_i\widehat{Y}_i)=0\) so we can find the model with the least overall error!
One complication that comes up due to setting the sum of errors equal to zero is that you have now clearly made the signs of some of the errors negative.
To overcome this we square the individual error terms and then discuss the SUM of SQUARED ERRORS,
\[SSE=\sum_{i=1}^n\varepsilon_i^2 =\sum_{i=1}^n(Y_i-\widehat{Y}_i)^2 =\sum_{i=1}^n(Y_i- (\widehat{\beta}_0+\widehat{\beta}_1 X))^2\]
We wish to estimate the probabilistic model in such a way that the square of these vertical distances is as small as possible, a method that invokes least-squares estimation. Consider the sum of the squared distances/errors, SSE, each bivariate data point lies away from the imaginary linear line, \(\widehat{Y}=\widehat{\beta}_0+\widehat{\beta}_1 X\).
We need to minimize SSE with respect to \(\widehat{\beta}_0\) then with respect to \(\widehat{\beta}_1\) by finding then setting the partial derivatives equal to zero. \(\frac{\delta SSE}{\delta\widehat{\beta }_i}\),where \(i=0,1\)
We will see:
The least-squares estimate of the Y -intercept of the model is:
\[\widehat{\beta}_0=\overline{Y}-\widehat{\beta}_1\overline{X}\]
The least-squares estimate of the slope of the model is:
\[\begin{align} \widehat{\beta}_1&=\frac{S_{XY}}{S_{XX}} =\frac{S_{XY}}{S_X S_X }=\frac{S_{XY}}{S_X^2 }=\frac{r S_y}{S_X} =\frac{\sum_{i=1}^n[(X_i-\overline{X} )(Y_i-\overline{Y })] }{\sum_{i=1}^n(X_i-\overline{X})^2 }\\ &=\frac{\sum_{i=1}^n(X_i Y_i )-n\overline{X}\overline{Y} }{\sum_{i=1}^n(X_i-\overline{X})^2 }=\frac{\sum_{i=1}^n(X_i Y_i )-n\overline{X} \overline{Y } }{\sum_{i=1}^nX_i^2-n \overline{X }^2 }\\ \end{align}\]
Looking for solved notes?
Let’s look back at our student height and weight data. Found in data file in D2L, using height as the predictor and weight as the response.
X <- c(63,64,66,69,69,71,71,72,73,75);
Y <- c(127,121,142,157,162,156,169,165,181,208);
Student=1:10;
data.frame("Student ID"=Student,height=X,weight=Y)
reg1=data.frame("Student ID"=Student,height=X,weight=Y)
cor(reg1$height,reg1$weight)
## [1] 0.9470984
cor(reg1$height,reg1$weight)^2
## [1] 0.8969953
plot(reg1$height,reg1$weight)
fit=lm(weight~height, data=reg1)
abline(fit)
fit
##
## Call:
## lm(formula = weight ~ height, data = reg1)
##
## Coefficients:
## (Intercept) height
## -266.534 6.138
fit$coefficients[2]
## height
## 6.137581
predict(fit,data.frame(height=2.5*12))
## 1
## -82.40695
predict(fit,data.frame(height=67))
## 1
## 144.6836
157-predict(fit,data.frame(height=69))
## 1
## 0.04127444
Please try this on your own before looking at the solution
How strong is the linear relationship between the age of a driver and the distance the driver can see? A research firm (Last Resource, Inc., Bellefonte, PA) collected data on a sample of \(n = 30\) drivers. What can you say about the relationship?
.csv file [right click “download linked file” or “save link as”, let’s import into R]
Import .csv file
fit=lm(Distance~Age,data = File)
summary(fit)
##
## Call:
## lm(formula = Distance ~ Age, data = File)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.231 -41.710 7.646 33.552 108.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 576.6819 23.4709 24.570 < 2e-16 ***
## Age -3.0068 0.4243 -7.086 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.76 on 28 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.6292
## F-statistic: 50.21 on 1 and 28 DF, p-value: 1.041e-07
cov(File$Distance,File$Age)
## [1] -1425.862
cor(File$Distance,File$Age)
## [1] -0.8012447
plot(Distance~Age,data = File)
curve(fit$coefficients[1]+fit$coefficients[2]*x,add=T,col="red")
Please try this on your own before looking at the solution